On Clustering and Evaluation of Narrow Domain Short-Text Corpora
نویسنده
چکیده
PhD thesis in Computer Science written by David Eduardo Pinto Avendaño under the supervision of Paolo Rosso (Univ. Politécnica de Valencia) and Héctor Jiménez (Univ. Autónoma Metropolitana, México). The author was examined in July 2008 in Valencia by the following committee: Manuel Palomar Sanz (Univ. de Alicante), Alfonso Ureña López (Univ. de Jaén), Eneko Agirre (Univ. del Páıs Vasco), Benno Stein (Weimar Univ., Germany) and Encarna Segarra Soriano (Univ. Politécnica de Valencia). The grade obtained was Sobresaliente Cum Laude.
منابع مشابه
Density-based clustering of short-text corpora∗ Agupamiento de textos cortos basado en densidad
In this work, we analyse the performance of different density-based algorithms on short-text and narrow domain short-text corpora. We attempt to determine to what extent the features of this kind of corpora impact on the density computation of the clusterings obtained and how robust these algorithms to the different complexity levels are.
متن کاملCharacterizing Weblog Corpora
In order to exploit the huge volume of information being published in the blogosphere, it is essential to provide techniques such as clustering, which can automatically analyze and classify their contents. However these typically can produce better results when dealing with wide domain full-text documents. In most cases however, blogs can be considered to be “short texts”, i.e., they are not ex...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملEvaluation of Internal Validity Measures in Short-Text Corpora
Short texts clustering is one of the most difficult tasks in natural language processing due to the low frequencies of the document terms. We are interested in analysing these kind of corpora in order to develop novel techniques that may be used to improve results obtained by classical clustering algorithms. In this paper we are presenting an evaluation of different internal clustering validity...
متن کاملClustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance
Clustering short length texts is a difficult task itself, but adding the narrow domain characteristic poses an additional challenge for current clustering methods. We addressed this problem with the use of a new measure of distance between documents which is based on the symmetric Kullback-Leibler distance. Although this measure is commonly used to calculate a distance between two probability d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Procesamiento del Lenguaje Natural
دوره 42 شماره
صفحات -
تاریخ انتشار 2009